Task -1.1 Understand how the H2o package helps in explainability with various plots on relation between attributes and the defect prediction.
!pip install h2o
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: h2o in c:\users\hp\appdata\roaming\python\python311\site-packages (3.46.0.4) Requirement already satisfied: requests in c:\programdata\anaconda3\lib\site-packages (from h2o) (2.31.0) Requirement already satisfied: tabulate in c:\programdata\anaconda3\lib\site-packages (from h2o) (0.9.0) Requirement already satisfied: charset-normalizer<4,>=2 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests->h2o) (2024.2.2)
DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
import h2o
from h2o.automl import H2OAutoML
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... ; Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.24+7-LTS-271, mixed mode) Starting server from C:\Users\hp\AppData\Roaming\Python\Python311\site-packages\h2o\backend\bin\h2o.jar Ice root: C:\Users\hp\AppData\Local\Temp\tmpgvsxm1d5 JVM stdout: C:\Users\hp\AppData\Local\Temp\tmpgvsxm1d5\h2o_hp_started_from_python.out JVM stderr: C:\Users\hp\AppData\Local\Temp\tmpgvsxm1d5\h2o_hp_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
| H2O_cluster_uptime: | 08 secs |
| H2O_cluster_timezone: | Asia/Kolkata |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.46.0.4 |
| H2O_cluster_version_age: | 25 days |
| H2O_cluster_name: | H2O_from_python_hp_go5fgq |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 1.979 Gb |
| H2O_cluster_total_cores: | 4 |
| H2O_cluster_allowed_cores: | 4 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://127.0.0.1:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| Python_version: | 3.11.7 final |
# Import wine quality dataset
f = "bug_pred.csv"
df = h2o.import_file(f)
# Reponse column
y = "defects"
# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]
# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)
# Explain leader model & compare with all AutoML models
exa = aml.explain(test)
# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)
# Explain a generic list of models
# use h2o.explain as follows:
# exl = h2o.explain(model_list, test)
# 1. loc : numeric % McCabe's line count of code
# 2. v(g) : numeric % McCabe "cyclomatic complexity"
# 3. ev(g) : numeric % McCabe "essential complexity"
# 4. iv(g) : numeric % McCabe "design complexity"
# 5. n : numeric % Halstead total operators + operands
# 6. v : numeric % Halstead "volume"
# 7. l : numeric % Halstead "program length"
# 8. d : numeric % Halstead "difficulty"
# 9. i : numeric % Halstead "intelligence"
# 10. e : numeric % Halstead "effort"
# 11. b : numeric % Halstead
# 12. t : numeric % Halstead's time estimator
# 13. lOCode : numeric % Halstead's line count
# 14. lOComment : numeric % Halstead's count of lines of comments
# 15. lOBlank : numeric % Halstead's count of blank lines
# 16. lOCodeAndComment: numeric
# 17. uniq_Op : numeric % unique operators
# 18. uniq_Opnd : numeric % unique operands
# 19. total_Op : numeric % total operators
# 20. total_Opnd : numeric % total operands
# 21: branchCount : numeric % of the flow graph
# 22. defects : {false,true} % module has/has not one or more
# % reported defects
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% AutoML progress: | 12:31:19.40: AutoML: XGBoost is not available; skipping it. ███████████████████████████████████████████████████████████████| (done) 100%
Leaderboard
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | auc | logloss | aucpr | mean_per_class_error | rmse | mse | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|---|
| GBM_4_AutoML_1_20240804_123118 | 0.805288 | 0.402924 | 0.381757 | 0.244162 | 0.344895 | 0.118952 | 209 | 0.096993 | GBM |
| GBM_grid_1_AutoML_1_20240804_123118_model_12 | 0.792926 | 0.427724 | 0.396849 | 0.255151 | 0.350326 | 0.122728 | 195 | 0.085532 | GBM |
| StackedEnsemble_AllModels_3_AutoML_1_20240804_123118 | 0.788805 | 0.36175 | 0.393796 | 0.258929 | 0.335485 | 0.11255 | 410 | 0.322775 | StackedEnsemble |
| GBM_5_AutoML_1_20240804_123118 | 0.787431 | 0.491831 | 0.331702 | 0.256868 | 0.364243 | 0.132673 | 177 | 0.073852 | GBM |
| StackedEnsemble_AllModels_1_AutoML_1_20240804_123118 | 0.787431 | 0.373368 | 0.374626 | 0.27919 | 0.339031 | 0.114942 | 288 | 0.183688 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_1_20240804_123118 | 0.786745 | 0.379763 | 0.341874 | 0.249313 | 0.343838 | 0.118225 | 249 | 0.134661 | StackedEnsemble |
| GBM_grid_1_AutoML_1_20240804_123118_model_1 | 0.785371 | 0.413049 | 0.417551 | 0.286401 | 0.345977 | 0.1197 | 164 | 0.059956 | GBM |
| StackedEnsemble_BestOfFamily_4_AutoML_1_20240804_123118 | 0.781937 | 0.350935 | 0.439612 | 0.269918 | 0.326603 | 0.10667 | 608 | 0.170749 | StackedEnsemble |
| GBM_grid_1_AutoML_1_20240804_123118_model_3 | 0.78125 | 0.383125 | 0.465615 | 0.255151 | 0.334459 | 0.111863 | 246 | 0.06207 | GBM |
| StackedEnsemble_AllModels_2_AutoML_1_20240804_123118 | 0.780563 | 0.357248 | 0.40619 | 0.23489 | 0.332808 | 0.110761 | 624 | 0.30774 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_3_AutoML_1_20240804_123118 | 0.773695 | 0.365955 | 0.347374 | 0.278846 | 0.337719 | 0.114054 | 252 | 0.151931 | StackedEnsemble |
| GBM_grid_1_AutoML_1_20240804_123118_model_8 | 0.772321 | 0.548396 | 0.295047 | 0.267857 | 0.372183 | 0.13852 | 203 | 0.052164 | GBM |
| GBM_grid_1_AutoML_1_20240804_123118_model_11 | 0.770948 | 0.40189 | 0.367915 | 0.277129 | 0.348621 | 0.121537 | 213 | 0.069689 | GBM |
| GBM_grid_1_AutoML_1_20240804_123118_model_7 | 0.770261 | 0.388116 | 0.378068 | 0.249657 | 0.341629 | 0.116711 | 112 | 0.064331 | GBM |
| GBM_grid_1_AutoML_1_20240804_123118_model_4 | 0.768201 | 0.383755 | 0.335872 | 0.275412 | 0.341224 | 0.116434 | 158 | 0.072048 | GBM |
| GBM_grid_1_AutoML_1_20240804_123118_model_10 | 0.768201 | 0.440429 | 0.275637 | 0.271635 | 0.361041 | 0.13035 | 215 | 0.050269 | GBM |
| GBM_3_AutoML_1_20240804_123118 | 0.76408 | 0.443622 | 0.310772 | 0.330701 | 0.356209 | 0.126885 | 209 | 0.089053 | GBM |
| GBM_grid_1_AutoML_1_20240804_123118_model_9 | 0.763393 | 0.411645 | 0.363103 | 0.269918 | 0.348777 | 0.121645 | 177 | 0.071796 | GBM |
| GLM_1_AutoML_1_20240804_123118 | 0.760646 | 0.382372 | 0.374458 | 0.253091 | 0.340037 | 0.115625 | 74 | 0.082809 | GLM |
| GBM_2_AutoML_1_20240804_123118 | 0.760646 | 0.422195 | 0.365288 | 0.277129 | 0.348829 | 0.121682 | 239 | 0.105836 | GBM |
[20 rows x 10 columns]
Confusion Matrix
Confusion matrix shows a predicted class vs an actual class.
GBM_grid_1_AutoML_1_20240804_123118_model_3
| false | true | Error | Rate | |
|---|---|---|---|---|
| false | 73.0 | 18.0 | 0.1978 | (18.0/91.0) |
| true | 5.0 | 11.0 | 0.3125 | (5.0/16.0) |
| Total | 78.0 | 29.0 | 0.215 | (23.0/107.0) |
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
Variable Importance Heatmap
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
Model Correlation
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Confusion Matrix
Confusion matrix shows a predicted class vs an actual class.
GBM_grid_1_AutoML_1_20240804_123118_model_3
| false | true | Error | Rate | |
|---|---|---|---|---|
| false | 73.0 | 18.0 | 0.1978 | (18.0/91.0) |
| true | 5.0 | 11.0 | 0.3125 | (5.0/16.0) |
| Total | 78.0 | 29.0 | 0.215 | (23.0/107.0) |
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
# Task 1 & 2
# Try the explain the relation between attributes and the prediction on Bike rental dataset
# Dataset: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset#
# You can use the code from the above blocks.
Task -1.2 Build an explainable ML model using the H2o package
bike_data_day = h2o.import_file("day.csv")
bike_data_hour = h2o.import_file("hour.csv")
print(bike_data_day)
print(bike_data_hour)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
1 2011-01-01 00:00:00 1 0 1 0 6 0 2 0.344167 0.363625 0.805833 0.160446 331 654 985
2 2011-01-02 00:00:00 1 0 1 0 0 0 2 0.363478 0.353739 0.696087 0.248539 131 670 801
3 2011-01-03 00:00:00 1 0 1 0 1 1 1 0.196364 0.189405 0.437273 0.248309 120 1229 1349
4 2011-01-04 00:00:00 1 0 1 0 2 1 1 0.2 0.212122 0.590435 0.160296 108 1454 1562
5 2011-01-05 00:00:00 1 0 1 0 3 1 1 0.226957 0.22927 0.436957 0.1869 82 1518 1600
6 2011-01-06 00:00:00 1 0 1 0 4 1 1 0.204348 0.233209 0.518261 0.0895652 88 1518 1606
7 2011-01-07 00:00:00 1 0 1 0 5 1 2 0.196522 0.208839 0.498696 0.168726 148 1362 1510
8 2011-01-08 00:00:00 1 0 1 0 6 0 2 0.165 0.162254 0.535833 0.266804 68 891 959
9 2011-01-09 00:00:00 1 0 1 0 0 0 1 0.138333 0.116175 0.434167 0.36195 54 768 822
10 2011-01-10 00:00:00 1 0 1 0 1 1 1 0.150833 0.150888 0.482917 0.223267 41 1280 1321
[731 rows x 16 columns]
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
1 2011-01-01 00:00:00 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0 3 13 16
2 2011-01-01 00:00:00 1 0 1 1 0 6 0 1 0.22 0.2727 0.8 0 8 32 40
3 2011-01-01 00:00:00 1 0 1 2 0 6 0 1 0.22 0.2727 0.8 0 5 27 32
4 2011-01-01 00:00:00 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0 3 10 13
5 2011-01-01 00:00:00 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0 0 1 1
6 2011-01-01 00:00:00 1 0 1 5 0 6 0 2 0.24 0.2576 0.75 0.0896 0 1 1
7 2011-01-01 00:00:00 1 0 1 6 0 6 0 1 0.22 0.2727 0.8 0 2 0 2
8 2011-01-01 00:00:00 1 0 1 7 0 6 0 1 0.2 0.2576 0.86 0 1 2 3
9 2011-01-01 00:00:00 1 0 1 8 0 6 0 1 0.24 0.2879 0.75 0 1 7 8
10 2011-01-01 00:00:00 1 0 1 9 0 6 0 1 0.32 0.3485 0.76 0 8 6 14
[17379 rows x 17 columns]
#Consider cnt as the target column.
y = "cnt"
# Split into train & test
train_day, test_day = bike_data_day.split_frame(ratios=[0.8], seed=1)
train_hour, test_hour = bike_data_hour.split_frame(ratios=[0.8], seed=1)
1.2.C Try at least two models (AutoML and gradient boosting) for the Day dataset.
# Model 1: AutoML for Day Data
aml_day = H2OAutoML(max_runtime_secs=60, seed=1)
aml_day.train(y=y, training_frame=train_day)
AutoML progress: | 17:18:34.999: AutoML: XGBoost is not available; skipping it. ███████████████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2OStackedEnsembleEstimator : Stacked Ensemble Model Key: StackedEnsemble_AllModels_3_AutoML_2_20240804_171834
| key | value |
|---|---|
| Stacking strategy | cross_validation |
| Number of base models (used / total) | 9/33 |
| # GBM base models (used / total) | 5/25 |
| # DRF base models (used / total) | 0/2 |
| # GLM base models (used / total) | 1/1 |
| # DeepLearning base models (used / total) | 3/5 |
| Metalearner algorithm | GLM |
| Metalearner fold assignment scheme | Random |
| Metalearner nfolds | 5 |
| Metalearner fold_column | None |
| Custom metalearner hyperparameters | None |
ModelMetricsRegressionGLM: stackedensemble ** Reported on train data. ** MSE: 1247.8413808717607 RMSE: 35.3247983840214 MAE: 27.206466046189394 RMSLE: 0.10124155347006471 Mean Residual Deviance: 1247.8413808717607 R^2: 0.9996719525303781 Null degrees of freedom: 581 Residual degrees of freedom: 572 Null deviance: 2213837175.773192 Residual deviance: 726243.6836673648 AIC: 5822.8216500055205
ModelMetricsRegressionGLM: stackedensemble ** Reported on cross-validation data. ** MSE: 14965.542740579755 RMSE: 122.33373508799507 MAE: 75.90850753331674 RMSLE: 0.14102933494854192 Mean Residual Deviance: 14965.542740579755 R^2: 0.996065679097662 Null degrees of freedom: 581 Residual degrees of freedom: 570 Null deviance: 2223467180.1638374 Residual deviance: 8709945.875017418 AIC: 7272.7047624599345
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | |
|---|---|---|---|---|---|---|---|
| aic | 1468.4111 | 113.0463640 | 1498.6598 | 1634.314 | 1356.6556 | 1483.7827 | 1368.6434 |
| loglikelihood | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| mae | 75.73595 | 5.8450193 | 76.34347 | 80.4476 | 65.79099 | 79.54669 | 76.55099 |
| mean_residual_deviance | 14787.581 | 5008.3354 | 12510.049 | 22666.312 | 11087.105 | 16801.36 | 10873.079 |
| mse | 14787.581 | 5008.3354 | 12510.049 | 22666.312 | 11087.105 | 16801.36 | 10873.079 |
| null_deviance | 444693440.0000000 | 69633104.0000000 | 543586750.0000000 | 442103328.0000000 | 372113120.0000000 | 477883008.0000000 | 387780960.0000000 |
| r2 | 0.9960248 | 0.0014449 | 0.9972383 | 0.9935909 | 0.9966791 | 0.9958799 | 0.9967360 |
| residual_deviance | 1741989.1 | 680250.94 | 1501205.9 | 2833289.0 | 1219581.6 | 1948957.6 | 1206911.9 |
| rmse | 120.31822 | 19.720097 | 111.84833 | 150.55336 | 105.29533 | 129.62006 | 104.27406 |
| rmsle | 0.0873384 | 0.1171473 | 0.0379719 | 0.2965448 | 0.0255581 | 0.0326093 | 0.0440081 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
#explaination for day data
exa_day = aml_day.explain(test_day)
Leaderboard
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|
| StackedEnsemble_AllModels_3_AutoML_2_20240804_171834 | 104.146 | 10846.4 | 64.5129 | 0.0240478 | 10846.4 | 158 | 0.12652 | StackedEnsemble |
| GBM_grid_1_AutoML_2_20240804_171834_model_2 | 105.818 | 11197.4 | 76.3059 | 0.0303159 | 11197.4 | 284 | 0.023553 | GBM |
| StackedEnsemble_BestOfFamily_4_AutoML_2_20240804_171834 | 107.9 | 11642.3 | 66.9886 | 0.0256663 | 11642.3 | 124 | 0.035693 | StackedEnsemble |
| GBM_grid_1_AutoML_2_20240804_171834_model_12 | 110.276 | 12160.8 | 65.4705 | 0.0275127 | 12160.8 | 347 | 0.018439 | GBM |
| GBM_grid_1_AutoML_2_20240804_171834_model_5 | 110.577 | 12227.3 | 73.5664 | 0.0355693 | 12227.3 | 471 | 0.020661 | GBM |
| StackedEnsemble_AllModels_2_AutoML_2_20240804_171834 | 113.762 | 12941.7 | 73.5913 | 0.03425 | 12941.7 | 137 | 0.089221 | StackedEnsemble |
| StackedEnsemble_AllModels_1_AutoML_2_20240804_171834 | 114.336 | 13072.6 | 73.9909 | 0.0348475 | 13072.6 | 159 | 0.076312 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_3_AutoML_2_20240804_171834 | 115.386 | 13313.9 | 75.1365 | 0.0351491 | 13313.9 | 130 | 0.036973 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_2_20240804_171834 | 115.386 | 13313.9 | 75.1365 | 0.0351491 | 13313.9 | 129 | 0.038569 | StackedEnsemble |
| GBM_3_AutoML_2_20240804_171834 | 118.547 | 14053.3 | 75.8875 | 0.0350878 | 14053.3 | 327 | 0.031385 | GBM |
| GBM_grid_1_AutoML_2_20240804_171834_model_17 | 133.847 | 17915.1 | 88.1824 | 0.0389797 | 17915.1 | 275 | 0.025767 | GBM |
| GBM_2_AutoML_2_20240804_171834 | 153.785 | 23649.9 | 109.836 | 0.0572311 | 23649.9 | 299 | 0.019709 | GBM |
| GBM_4_AutoML_2_20240804_171834 | 165.463 | 27377.9 | 113.947 | 0.0612086 | 27377.9 | 368 | 0.039046 | GBM |
| GBM_5_AutoML_2_20240804_171834 | 171.761 | 29501.9 | 116.709 | 0.0632638 | 29501.9 | 277 | 0.022351 | GBM |
| GBM_grid_1_AutoML_2_20240804_171834_model_3 | 188.878 | 35675.1 | 131.436 | 0.0692319 | 35675.1 | 266 | 0.035318 | GBM |
| GBM_grid_1_AutoML_2_20240804_171834_model_4 | 189.021 | 35729.1 | 127.013 | 0.0555543 | 35729.1 | 478 | 0.029991 | GBM |
| GBM_grid_1_AutoML_2_20240804_171834_model_7 | 215.801 | 46570 | 154.323 | 0.080575 | 46570 | 467 | 0.022053 | GBM |
| GBM_grid_1_AutoML_2_20240804_171834_model_9 | 221.38 | 49009.2 | 150.878 | 0.0764 | 49009.2 | 374 | 0.023286 | GBM |
| GBM_grid_1_AutoML_2_20240804_171834_model_13 | 228.423 | 52177.1 | 164.21 | 0.084509 | 52177.1 | 674 | 0.026939 | GBM |
| XRT_1_AutoML_2_20240804_171834 | 235.211 | 55324.1 | 169.168 | 0.0889955 | 55324.1 | 349 | 0.012732 | DRF |
[20 rows x 9 columns]
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
Variable Importance Heatmap
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
Model Correlation
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
from h2o.estimators.gbm import H2OGradientBoostingEstimator
# Model 2: Gradient Boosting for Day Data
gbm_day = H2OGradientBoostingEstimator(seed=1)
gbm_day.train(y=y, training_frame=train_day)
gbm_day.explain(test_day)
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
1.2 AutoML model for hourly data
aml_hour = H2OAutoML(max_runtime_secs=60, seed=1)
aml_hour.train(y=y, training_frame=train_hour)
aml_hour.explain(test_hour)
AutoML progress: | 17:44:11.723: AutoML: XGBoost is not available; skipping it. ███████████████████████████████████████████████████████████████| (done) 100%
Leaderboard
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|
| StackedEnsemble_AllModels_1_AutoML_3_20240804_174411 | 2.04977 | 4.20155 | 1.6105 | nan | 4.20155 | 765 | 0.048315 | StackedEnsemble |
| StackedEnsemble_AllModels_2_AutoML_3_20240804_174411 | 2.05192 | 4.21039 | 1.61032 | nan | 4.21039 | 768 | 0.049409 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_3_20240804_174411 | 2.05915 | 4.24011 | 1.63023 | nan | 4.24011 | 427 | 0.009716 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_3_AutoML_3_20240804_174411 | 2.0645 | 4.26216 | 1.63282 | nan | 4.26216 | 350 | 0.014528 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_1_AutoML_3_20240804_174411 | 2.08721 | 4.35647 | 1.66504 | nan | 4.35647 | 587 | 0.043062 | StackedEnsemble |
| GLM_1_AutoML_3_20240804_174411 | 3.08055 | 9.48976 | 2.1927 | nan | 9.48976 | 67 | 0.000697 | GLM |
| GBM_2_AutoML_3_20240804_174411 | 4.16465 | 17.3443 | 2.63865 | 0.0742366 | 17.3443 | 1098 | 0.008717 | GBM |
| GBM_1_AutoML_3_20240804_174411 | 5.7954 | 33.5867 | 3.22449 | 0.0617901 | 33.5867 | 5089 | 0.04262 | GBM |
| DeepLearning_1_AutoML_3_20240804_174411 | 6.6479 | 44.1946 | 4.79754 | nan | 44.1946 | 265 | 0.001382 | DeepLearning |
| GBM_3_AutoML_3_20240804_174411 | 8.05279 | 64.8474 | 5.291 | 0.154668 | 64.8474 | 1161 | 0.009942 | GBM |
| GBM_4_AutoML_3_20240804_174411 | 8.72024 | 76.0426 | 6.41598 | 0.305225 | 76.0426 | 1092 | 0.006098 | GBM |
| DRF_1_AutoML_3_20240804_174411 | 10.466 | 109.537 | 6.42336 | 0.129976 | 109.537 | 987 | 0.00191 | DRF |
| XRT_1_AutoML_3_20240804_174411 | 23.5206 | 553.22 | 13.4973 | 0.190477 | 553.22 | 450 | 0.001225 | DRF |
| GBM_5_AutoML_3_20240804_174411 | 36.0051 | 1296.37 | 27.16 | 0.740993 | 1296.37 | 194 | 0.003126 | GBM |
| GBM_grid_1_AutoML_3_20240804_174411_model_1 | 98.0476 | 9613.34 | 74.7003 | 1.18851 | 9613.34 | 237 | 0.002081 | GBM |
[15 rows x 9 columns]
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
Variable Importance Heatmap
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
Model Correlation
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
Leaderboard
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|
| StackedEnsemble_AllModels_1_AutoML_3_20240804_174411 | 2.04977 | 4.20155 | 1.6105 | nan | 4.20155 | 765 | 0.048315 | StackedEnsemble |
| StackedEnsemble_AllModels_2_AutoML_3_20240804_174411 | 2.05192 | 4.21039 | 1.61032 | nan | 4.21039 | 768 | 0.049409 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_3_20240804_174411 | 2.05915 | 4.24011 | 1.63023 | nan | 4.24011 | 427 | 0.009716 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_3_AutoML_3_20240804_174411 | 2.0645 | 4.26216 | 1.63282 | nan | 4.26216 | 350 | 0.014528 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_1_AutoML_3_20240804_174411 | 2.08721 | 4.35647 | 1.66504 | nan | 4.35647 | 587 | 0.043062 | StackedEnsemble |
| GLM_1_AutoML_3_20240804_174411 | 3.08055 | 9.48976 | 2.1927 | nan | 9.48976 | 67 | 0.000697 | GLM |
| GBM_2_AutoML_3_20240804_174411 | 4.16465 | 17.3443 | 2.63865 | 0.0742366 | 17.3443 | 1098 | 0.008717 | GBM |
| GBM_1_AutoML_3_20240804_174411 | 5.7954 | 33.5867 | 3.22449 | 0.0617901 | 33.5867 | 5089 | 0.04262 | GBM |
| DeepLearning_1_AutoML_3_20240804_174411 | 6.6479 | 44.1946 | 4.79754 | nan | 44.1946 | 265 | 0.001382 | DeepLearning |
| GBM_3_AutoML_3_20240804_174411 | 8.05279 | 64.8474 | 5.291 | 0.154668 | 64.8474 | 1161 | 0.009942 | GBM |
| GBM_4_AutoML_3_20240804_174411 | 8.72024 | 76.0426 | 6.41598 | 0.305225 | 76.0426 | 1092 | 0.006098 | GBM |
| DRF_1_AutoML_3_20240804_174411 | 10.466 | 109.537 | 6.42336 | 0.129976 | 109.537 | 987 | 0.00191 | DRF |
| XRT_1_AutoML_3_20240804_174411 | 23.5206 | 553.22 | 13.4973 | 0.190477 | 553.22 | 450 | 0.001225 | DRF |
| GBM_5_AutoML_3_20240804_174411 | 36.0051 | 1296.37 | 27.16 | 0.740993 | 1296.37 | 194 | 0.003126 | GBM |
| GBM_grid_1_AutoML_3_20240804_174411_model_1 | 98.0476 | 9613.34 | 74.7003 | 1.18851 | 9613.34 | 237 | 0.002081 | GBM |
[15 rows x 9 columns]
Residual Analysis
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning Curve Plot
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Variable Importance
The variable importance plot shows the relative importance of the most important variables in the model.
Variable Importance Heatmap
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
Model Correlation
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP Summary
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectation
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
Task -1.2.D Discuss how various plots try to explain the amount of bikes rentedcorrespond to the various environmental conditions.
The variable importance plot shows how different features (variables) contribute to the prediction of the number of bike rentals (cnt). Below are the observations from the generated graphs:
Key variables from the all the available variables seems to be the user type and if the user is registered or not. Registered users shows a high possibility of booking a bike and attracting casual user seems to be a bit challenging however there had been bookings by casual users as well.
- From business perspective the business should try to attract the users and getting users registered for the booking in future, this will help business in getting futher bookings.
Weather also a play an important role in bike rentals as humidity and temperature impact a person's prefereance of choosing a bike over taxi or car. As per the data below is the impact of different weather factors on bike rental:
a. Temperature: As per the data the bike rental count increases with increase in temprature. Both temp (actual temperature) and atemp (apparent temperature) are important, but apparent temperature can be a better indicator as it has more influence to the data and reflects human perception.
b. Humidity: High humidity can make biking uncomfortable, leading to a decrease in bike rentals with increase in humidity, same is reflected by the data provided. From business perspective on high humidity days, services can offer discounts or special deals to encourage usage.
c. Windspeed: High wind speeds can deter biking due to safety concerns and physical difficulty and reduces bike rental. On windy days, bike-sharing companies can alert users and advise caution also to increase the rentals and keeping the customers satisfied to maintain relationships, business can reduce pricing or alternative transportation options can be provided.
d. Weather Situation: Clear and partly cloudy weather conditions (weathersit = 1) are ideal for biking, resulting in higher rental and adverse weather conditions (weathersit = 3 or 4) significantly reduced bike rentals. Insight for business could be : On clear days, bike-sharing services can expect higher demand and prepare accordingly.And on adverse weather days, bike-sharing services can offer alternative promotions to retain user engagement.